Get it at https://dl.dropboxusercontent.com/u/75194/BDF/Olives_At_BDF.ipynb
#!pip install seaborn
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
from IPython.core.display import Image
Image(url="http://qph.is.quoracdn.net/main-qimg-3504cc03d0a1581096eba9ef97cfd7eb?convert_to_webp=true")
The Ipython Notebook serves as a Process Notebook for your research, spanning from your playing around, to your communicating your data. Use Git(hub) to version all the notebooks you create.
Image(url="https://dl.dropboxusercontent.com/u/75194/BDF/Italy.png")
I found this data set in the RGGobi book (http://www.ggobi.org/book/), from which the above diagram is taken. It has "the percentage composition of fatty acids found in the lipid fraction of Italian olive oils', with oils from 3 regions of Italy: the North, the South, and Sardinia. The regions themselves are subdivided into areas as shown in the map above. The source for this data is:
Forina, M., Armanino, C., Lanteri, S. & Tiscornia, E. (1983), Classification of Olive Oils from their Fatty Acid Composition, in Martens, H. and Russwurm Jr., H., eds, Food Research and Data Analysis, Applied Science Publishers, London, pp. 189–214.
From my friend and co-TF for CS109 (see https://github.com/cs109/content) Chris Beaumont, wonderfully expressed in:
http://nbviewer.ipython.org/github/cs109/content/blob/master/lec_04_wrangling.ipynb
"I'd like to suggest a basic rubric for the early stages of exploratory data analysis in Python. This isn't universally applicable, but it does cover many patterns which recur in several data analysis contexts. It's useful to keep this rubric in mind when encountering a new dataset."
The basic workflow is as follows:
Clean the DataFrame. It should have the following properties:
Explore global properties. Use histograms, scatter plots, and aggregation functions to summarize the data.
This process transforms your data into a format which is easier to work with, gives you a basic overview of the data's properties, and likely generates several questions for you to followup in subsequent analysis."
import requests
r=requests.get("https://dl.dropboxusercontent.com/u/75194/BDF/olive.csv")
print r.status_code
with open("local_olive.csv","w") as f:
f.write(r.text.encode("utf-8"))
df=pd.read_csv("local_olive.csv")
df.head(5)
type(df), type(df.region)
Image(url="https://dl.dropboxusercontent.com/u/75194/BDF/pandastruct.png")
df.dtypes
We look at the dtypes to see if the columns make sense.
Let's rename that ugly first column. A Google search for 'python pandas dataframe rename' points you at this documentation.
print df.columns
df.rename(columns={df.columns[0]:'areastring'}, inplace=True)
df.columns
Let's explore. Which unique regions and areas are contained in the dataset?
print 'regions\t', df.region.unique()
print 'areas\t', df.area.unique()
pd.value_counts(df.region)
pd.value_counts(df.area)
Let's get rid of the junk numbering in df.areastring. For single column Pandas Series we use map. Here's the documentation.
df.areastring=df.areastring.map(lambda x: x.split('.')[-1])
df.head()
rkeys=[1,2,3]
rvals=['South','Sardinia','North']
rmap={e[0]:e[1] for e in zip(rkeys,rvals)}
print rmap
df['regionstring']=df.region.map(lambda x: rmap[x])
df.head()
To access a specific subset of columns of a dataframe, we can use list indexing.
df[['palmitic','oleic']].head()
The above is a DataFrame.
To access the series of entries of a single column, we could do the following.
print df['palmitic']
Remember the acid percentages were ints. They need to be renormalized. Lets do this:
acidlist=['palmitic', 'palmitoleic', 'stearic', 'oleic', 'linoleic', 'linolenic', 'arachidic', 'eicosenoic']
#your code here
dfsub=df[acidlist].apply(lambda x: x/100.0)
dfsub.head()
Notice that we can replace part of the dataframe by this new frame. Since we need the percentages, let's do this. The Oleic percentages should be in the 70s and 80s if you did this right.
df[acidlist]=dfsub
df.head()
df.to_csv("local-olives-cleaned.csv", index=False)
pd.crosstab(df.areastring, df.regionstring)
pd.value_counts(df.areastring, sort=False).plot(kind="bar");
pd.value_counts(df.regionstring, sort=False).plot(kind="barh");
df.describe()
df[acidlist].median().plot(kind="bar");
df.palmitic.hist()
plt.xlabel("% of acid")
plt.ylabel("count of oils")
plt.title("Histogram of Palmitic Acid content");
df.palmitic.hist(bins=30, alpha=0.5);
sns.distplot(df.palmitic);
The first concept we deal with here is pandas groupby. The idea is to group a dataframe by the values of a particular factor variable. The documentation can be found here.
What we do here ia a technique called Split, Apply, and Combine.
#split
region_groupby = df.groupby('region')
print type(region_groupby)
region_groupby.head()
The groupby function also acts like an object that can be mapped. After the mapping is complete, the rows are put together (reduced) into a larger dataframe. For example, using the describe function.
region_groupby.describe()
One can do bulk functions on the regional group dataframes or series:
region_groupby.eicosenoic.mean()
region_groupby.mean()
Or one can use aggregate to pass an arbitrary function of to the sub-dataframe. The function is applied columnwise.
dfbymean=df.groupby("regionstring").aggregate(np.mean)
dfbymean.head()
with sns.axes_style("white", {'grid':False}):
dfbymean[acidlist].plot(kind='barh', stacked=True);
sns.despine()
Or one can use apply to pass an arbitrary function to the sub-dataframe. This one takes the dataframe as argument.
region_groupby.apply(lambda f: f.palmitic.mean())
g=sns.FacetGrid(df, col="region")
g.map(plt.scatter,"eicosenoic", "linoleic");
Clearly, region 1 or the South can visually be separated out by eicosenoic fraction itself.
with sns.axes_style("white"):
g=sns.FacetGrid(df, col="region")
g.map(sns.distplot, "eicosenoic")
We make a SPLOM using seaborn to see in what space the regions may be separated. Note that linoleic and oleic seem promising. And perhaps arachidic paired with eicosenoic.
sns.pairplot(df, vars=acidlist, hue="regionstring", size=2.5, diag_kind='kde');
Pandas supports conditional indexing: documentation. Lets use it to follow up on the clear pattern of Southern oils seeeming to be separable by just the eicosenoic feature.
loweico=df[df.eicosenoic < 0.02]
pd.crosstab(loweico.areastring, loweico.regionstring)
Indeed this is the case! Can also be seen using parallel co-ordinates:
from pandas.tools.plotting import parallel_coordinates
dfna=df[acidlist]
#normalizing by range
dfna_norm = (dfna - dfna.mean()) / (dfna.max() - dfna.min())
with sns.axes_style("white"):
parallel_coordinates(df[['regionstring']].join(dfna_norm), 'regionstring', alpha=0.3)
dfsouth=df[df.regionstring=='South']
dfsouth.head()
We make a couple of SPLOM's, one with sicily and one without sicily, to see whats separable. Sicily seems to be a problem. As before, see the KDE's first to see if separability exists and then let the eye look for patterns.
sns.pairplot(dfsouth, hue="areastring", size=2.5, vars=acidlist, diag_kind='kde');
sns.pairplot(dfsouth[dfsouth.areastring!="Sicily"], hue="areastring", size=2.5, vars=acidlist, diag_kind='kde');
Seems that combinations of oleic, palmitic, palmitoleic might be useful?
So lets do some prediction. After all we want to test for pourity of oils, and climactic effects.
The usual way we do prediction is to minimize a loss function. This is the most general formulation. These minimizations can come out of generative models like LDA, discriminative models like Logistic Regression, or discriminant models like SVM which have only a cost function but no probabilistic model behind them.
Models may be parametric like logistic regression where we fit for regression coefficients, or non parametric like kNN or decision trees. Amongst those two, the former is a memory based model, where you need to keep the data set in memory to make predictions. Sometimes you keep inner products of points in memory, in the so-called kernelised algorithms.
First, a bit about the structure of scikit-learn:
Some of the following text is taken from the scikit-learn API paper: http://arxiv.org/pdf/1309.0238v1.pdf
All objects within scikit-learn share a uniform common basic API consisting of three complementary interfaces: an estimator interface for building and fitting models, a predictor interface for making predictions and a transformer interface for converting data. The estimator interface is at the core of the library. It defines instantiation mechanisms of objects and exposes a fit method for learning a model from training data. All supervised and unsupervised learning algorithms (e.g., for classification, regression or clustering) are offered as objects implementing this interface. Machine learning tasks like feature extraction, feature selection or dimensionality reduction are also provided as estimators.
An example along these lines:
clf = LogisticRegression()
clf.fit(X_train, y_train)
If one changes classifiers, say, to a Random Forest classifier, one would simply replace LogisticRegression() in the snippet above by RandomForestClassifier().
The predictor interface extends the notion of an estimator by adding a predict method that takes an array X_test and produces predictions for X_test, based on the learned parameters of the estimator. In the case of supervised learning estimators, this method typically returns the predicted labels or values computed by the model. Some unsupervised learning estimators may also implement the predict interface, such as k-means, where the predicted values are the cluster labels.
clf.predict(X_test)
Since it is common to modify or filter data before feeding it to a learning algorithm, some estimators in the library implement a transformer interface which defines a transform method. It takes as input some new data X_test and yields as output a transformed version of X_test. Preprocessing, feature selection, feature extraction and dimensionality reduction algorithms are all provided as transformers within the library.
This is usually done via the fit_transform method. For example, to do a PCA:
pca = RandomizedPCA(n_components=2)
train_x = pca.fit_transform(train_x)
test_x = pca.transform(test_x)
The training set here is "fit" to find the PC components, and then then transformed. Since pca.fit() by itself changes the pca object, if we want to transform other data using the same transformation we simply call transform subsequently.
Most machine learning algorithms implemented in scikit-learn expect data to be stored in a two-dimensional array or matrix. The arrays can be either numpy arrays, or in some cases scipy.sparse matrices. The size of the array is expected to be [n_samples, n_features]
#some code taken from https://github.com/jakevdp/ESAC-stats-2014
from IPython.html.widgets import interact
from mpl_toolkits import mplot3d
def plot_3D(X, y, z, elev=30, azim=30):
ax = plt.subplot(projection='3d')
ax.scatter3D(X[:, 0], X[:, 1], z, c=y, s=50)
ax.view_init(elev=elev, azim=azim)
ax.set_xlabel('eicosenoic')
ax.set_ylabel('linoleic')
ax.set_zlabel('arachidic')
#set azim to -77, elev to 24
def plotter(elev, azim):
return plot_3D(df[['eicosenoic', 'linoleic']].values, df.region.values, df['arachidic'].values, elev=elev, azim=azim)
interact(plotter, elev=[-90, 90], azim=(-180, 180));
from sklearn.lda import LDA
X=df[acidlist]
y=df.region
sklearn_lda = LDA(n_components=2)
X_lda_sklearn = sklearn_lda.fit_transform(X, y)
plt.scatter(X_lda_sklearn[:, 0], X_lda_sklearn[:, 1], c=y, s=50)
plt.xlabel("LDA1");
plt.ylabel("LDA2");
dfnew=df[['eicosenoic', 'region', 'regionstring']]
dfnew['linoarch']=(0.969/1022.0)*df.linoleic + (0.245/105.0)*df.arachidic
dfnew.head()
plt.scatter(dfnew.linoarch, dfnew.eicosenoic, c=dfnew.region, s=50)
plt.xlabel("linoarch")
plt.ylabel("eicosenoic");
def standardize(l):
return (l - l.mean()) / l.std()
plt.scatter(standardize(dfnew.linoarch), standardize(dfnew.eicosenoic), c=dfnew.region, s=50);
plt.scatter(standardize(df.linoleic), standardize(df.eicosenoic), c=df.region, s=50)
dfnosouth=df[df.regionstring!='South']
dfnosouth.head()
plt.scatter(standardize(dfnosouth.linoleic), standardize(dfnosouth.arachidic), c=dfnosouth.region, s=50);
from Jesse Johnson's excellent Shape of Data; http://shapeofdata.wordpress.com/2013/05/14/linear-separation-and-support-vector-machines/
The idea is to draw a line in space between the classes. But not any line, but the line which gives the maximum margin rectangle between points of different classes.
The points right at the boundary are called the support vectors, which is where the name comes from.
But what if the separability is not so simple, and there are points intruding?

Then the idea is to minimize the distance of the "crossed over" points from the separating line. These crossed over points are costed using "slack" vectors. You dont want too many of these.
You obtain the line my minimizing the Hinge Loss
from sklearn.cross_validation import train_test_split
from sklearn.metrics import confusion_matrix
#Stolen from Jake's notebooks, above: https://github.com/jakevdp/ESAC-stats-2014
from sklearn.datasets.samples_generator import make_blobs
from sklearn.svm import SVC # "Support Vector Classifier"
def plot_svc_decision_function(clf, ax=None):
"""Plot the decision function for a 2D SVC"""
if ax is None:
ax = plt.gca()
x = np.linspace(plt.xlim()[0], plt.xlim()[1], 30)
y = np.linspace(plt.ylim()[0], plt.ylim()[1], 30)
Y, X = np.meshgrid(y, x)
P = np.zeros_like(X)
for i, xi in enumerate(x):
for j, yj in enumerate(y):
P[i, j] = clf.decision_function([xi, yj])
return ax.contour(X, Y, P, colors='k',
levels=[-1, 0, 1], alpha=0.5,
linestyles=['--', '-', '--'])
def plot_svm(N):
X, y = make_blobs(n_samples=200, centers=2,
random_state=0, cluster_std=0.60)
X = X[:N]
y = y[:N]
clf = SVC(kernel='linear')
clf.fit(X, y)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='spring')
plt.xlim(-1, 4)
plt.ylim(-1, 6)
plot_svc_decision_function(clf, plt.gca())
plt.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1],
s=200, facecolors='none')
interact(plot_svm, N=[10, 200], kernel='linear');
Notice how the points that mainly matter are the ones which are near the support vector. If new such points come in, the prediction can change.
So far we have been progressing with the entire data set. This is crazy. We have nothing to test on, and further, we might be particularizing to the peculiarities of our sample!
X = dfnosouth[['linoleic', 'arachidic']]
Xstd= (X -X.mean())/X.std()
y = (dfnosouth.regionstring.values=='Sardinia')*1
Xtrain, Xtest, ytrain, ytest = train_test_split(Xstd.values,y)
clf = SVC(kernel="linear")
clf.fit(Xtrain, ytrain)
plt.scatter(Xtrain[:, 0], Xtrain[:, 1], c=ytrain, s=50, cmap='spring', alpha=0.3)
plot_svc_decision_function(clf, plt.gca())
plt.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1],
s=200, facecolors='none')
plt.scatter(Xtest[:, 0], Xtest[:, 1], c=ytest, s=50, marker="s", cmap='spring', alpha=0.5);
clf.score(Xtest, ytest)
Often in SVMs one uses the kernel trick, which maps a lower dimension to a higher one to make things separable.
See (from above mentioned book)

So lets see what using a Radial Gaussian kernel look like?
X = dfnosouth[['linoleic', 'arachidic']]
Xstd= (X -X.mean())/X.std()
y = (dfnosouth.regionstring.values=='Sardinia')*1
Xtrain, Xtest, ytrain, ytest = train_test_split(Xstd.values,y)
clf = SVC()
clf.fit(Xtrain, ytrain)
plt.scatter(Xtrain[:, 0], Xtrain[:, 1], c=ytrain, s=50, cmap='spring', alpha=0.3)
plot_svc_decision_function(clf, plt.gca())
plt.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1],
s=200, facecolors='none')
plt.scatter(Xtest[:, 0], Xtest[:, 1], c=ytest, s=50, marker="s", cmap='spring', alpha=0.5);
clf
confusion_matrix(clf.predict(Xtest), ytest)
The SVM is a non parametric, cost function minimization baded model. But why not consider a simpler voting based model? kNN is an algorithm where for every point, we find the k nearest neighbors. We then ask them: what are you? We assign to the point the value we get from the majority vote of the neighbors.
The key thing to take from here is the locality of the classification.
from matplotlib.colors import ListedColormap
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])
def points_plot2(X, Xtr, Xte, ytr, yte, clf, colorscale=cmap_light, cdiscrete=cmap_bold):
h = .02
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
np.linspace(y_min, y_max, 100))
plt.figure(figsize=(10,6))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.pcolormesh(xx, yy, Z, cmap=cmap_light, alpha=0.2)
plt.scatter(Xtr[:, 0], Xtr[:, 1], c=ytr-1, cmap=cdiscrete, s=50, alpha=0.2,edgecolor="k")
# and testing points
yact=clf.predict(Xte)
plt.scatter(Xte[:, 0], Xte[:, 1], c=yte-1, cmap=cdiscrete, alpha=0.5, marker="s", s=35)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
return plt.gca()
def classify_tt(indf, inacidlist, clon, clf, train_size=0.6):
subdf=indf[inacidlist]
subdfstd=(subdf - subdf.mean())/subdf.std()
X=subdfstd.values
y=indf[clon].values
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, train_size=train_size)
clf=clf.fit(Xtrain, ytrain)
training_accuracy = clf.score(Xtrain, ytrain)
test_accuracy = clf.score(Xtest, ytest)
print "Accuracy on training data: %0.2f" % (training_accuracy)
print "Accuracy on test data: %0.2f" % (test_accuracy)
Xtr=np.concatenate((Xtrain, Xtest))
if len(inacidlist) ==2 :
print "making plot"
points_plot2(Xtr, Xtrain, Xtest, ytrain, ytest, clf)
return clf, Xtest, ytest
dfsouthns=dfsouth[dfsouth.areastring!="Sicily"]
from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier(20, warn_on_equidistant=False)
clf, Xtest, ytest=classify_tt(dfsouthns, ['palmitic','palmitoleic'],'area',clf)
At high k, you are in the bias regime, as you are averaging too many neighbors. At low k you are in the variance regime, as you are very sensitive to local effects. You will overfit in this regime, as you are basically capturing local details rather than any model. We see this tension between locality and bias play out in all classifiers.
def class_with_size(K,T):
clf = KNeighborsClassifier(K, warn_on_equidistant=False)
clf, Xest, ytest=classify_tt(dfsouthns, ['palmitic','palmitoleic'],'area',clf, T)
interact(class_with_size, K=[1,40], T=[0.1,1.0]);
What are we doing in finding all these fits?
The bottom line is: finding the model which has an appropriate mix of bias and variance. We usually want to sit at the point of the tradeoff between the two: be simple but no simpler than necessary.
We do not want a model with too much variance: it would not generalize well. This phenomenon is also called overfitting. There is no point doing prediction if we cant generalize well. At the same time, if we have too much bias in our model, we will systematically underpredict or overpredict values and miss most predictions. This is also known as underfitting.
Cross-Validation provides us a way to find the "hyperparameters" of our model, such that we achieve the balance point.
Read http://scott.fortmann-roe.com/docs/BiasVariance.html
How to tell that a hypothesis is overfitting? Its not enough that the training error is low, though thats certainly an indication.
The training error is low but test error is high!
If we plot training error against, say, COMPLEXITY d, the training error will decrease with increasing d. But for the cross-validation (or for that matter, test error), we'll have an error curve which has a minumum and goes up again. Here the complexity is the inverse of the k.

def get_stats(indf, inacidlist, clon):
subdf=indf[inacidlist]
subdfstd=(subdf - subdf.mean())/subdf.std()
X=subdfstd.values
y=indf[clon].values
print y.shape
numk=40
numt=15
numberOfRows=numk*numt
df=pd.DataFrame(index=np.arange(0, numberOfRows), columns=('knearest', 'trainsize', 'train_accuracy', 'test_accuracy') )
counter=0
for tsize in np.arange(0.2, 0.95, 0.05):
for k in range(1,numk+1):
#print k, tsize
tra=[]
tea=[]
for n in range(30):
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, train_size=tsize)
clf = KNeighborsClassifier(k, warn_on_equidistant=False)
clf=clf.fit(Xtrain, ytrain)
tra.append(clf.score(Xtrain, ytrain))
tea.append(clf.score(Xtest, ytest))
training_accuracy=np.mean(tra)
test_accuracy=np.mean(tea)
df.loc[counter] = [k, tsize, training_accuracy, test_accuracy]
counter=counter+1
return df
dfknn = get_stats(dfsouthns, ['palmitic','palmitoleic'],'area')
dft65=dfknn[(0.63 < dfknn.trainsize) & (dfknn.trainsize < 0.67)]
dft65.head()
plt.plot(dft65.knearest, 1.-dft65.train_accuracy)
plt.plot(dft65.knearest, 1.-dft65.test_accuracy)
Here we plot the train vs cv/test error as a function of the size of the training set.
The training set error increases as size of the data set increases. The intuition is that with more samples, you get further away from the interpolation limit. The cross validation error on the otherhand will decrease as training set size increases, as , more data you have better the hypothesis you fit.
High Bias
Now consider the high bias situation. The training error will increase as before, to a point, and then flatten out. (There is only so much you can do to make a straight line fit a quadratic curve). The cv/test error, on the other hand will decrease, but then, it too will flatten out. These will be very close to each other, and after a point, getting more training data will not help!

Next consider the high variance situation. The training error will start out very low as usual, and go up slowly as even though we add points, we have enough wiggle room to start with, until it runs out and the error keeps increasing. The cv error, will, on the other hand, start out quite high, and remain high. Thus we will have a gap. In this case it will make sense to take more data, as that would drive the cv error down, and the training error up, until they meet.

dfk40=dfknn[dfknn.knearest==40]
plt.plot(dfk40.trainsize, 1.-dfk40.train_accuracy)
plt.plot(dfk40.trainsize, 1.-dfk40.test_accuracy)
dfk2=dfknn[dfknn.knearest==2]
plt.plot(dfk2.trainsize, 1.-dfk2.train_accuracy)
plt.plot(dfk2.trainsize, 1.-dfk2.test_accuracy)
Its not enough to train one one training set, and test on another. Suppose you landed up, randomly with a peculiar sample? Furthermore, most models have some other parameters which need to be trained. For example, kNN has the number of neighbors. The SVM has a regularization parameter 1/C and if radial, the bandwidth
In scikit-learn, there is the concept of a meta-estimator, which behaves quite similarly to standard estimators, but allows us to wrap, for example, cross-validation, or methods that build and combine simpler models or schemes. For example:
from sklearn.multiclass import OneVsOneClassifier
clf=OneVsOneClassifier(LogisticRegression())
In scikit-learn, model selection is supported in two distinct meta-estimators, GridSearchCV and RandomizedSearchCV. They take as input an estimator (basic or composite), whose hyper-parameters must be optimized, and a set of hyperparameter settings to search through.
We now actually go ahead and to the train/test split. Not once but multiple times, on a grid search, for different values of C, where C is our complexity parameter(s). For each C, we:
n_folds folds. n_folds -1 of these folds, test on the remainingC, and find the optimal value that minimizes mean square error.Notice the structure of the GridSearchCV estimator in cv_optimize.
from sklearn.grid_search import GridSearchCV
def cv_optimize_knn(X, y, n_folds=10):
clf = KNeighborsClassifier()
parameters = {"n_neighbors": range(1,60,1)}
gs = GridSearchCV(clf, param_grid=parameters, cv=n_folds)
gs.fit(X, y)
return gs
def get_optim_classifier_knn(indf, inacidlist, clon):
subdf=indf[inacidlist]
subdfstd=(subdf - subdf.mean())/subdf.std()
X=subdfstd.values
y=indf[clon].values
#Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, train_size=0.95)
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, train_size=0.8)
fitted=cv_optimize_knn(Xtrain, ytrain)
#fitted=cv_optimize(X, y)
return fitted, Xtest, ytest
thefit, Xte, yte = get_optim_classifier_knn(dfsouthns, ['palmitic','palmitoleic'],'area')
thefit.best_estimator_, thefit.best_params_, thefit.best_score_
thefit.score(Xte, yte)
plt.scatter([e[0]['n_neighbors'] for e in thefit.grid_scores_],[e[1] for e in thefit.grid_scores_])
We alluded earlier to the cost function for SVM and talked about the regularization parameter
The loss looks like
If C is large, the Complexity penalty, usually
The actual objective function that soft margin SVM tries to minimize is
However, for hard margin SVM, the whole objective function is just
as the sign flips and the max becomes 0
def cv_optimize_svm(X, y, n_folds=10, num_p=50):
#clf = SVC()
#parameters = {"C": np.logspace(-4, 3, num=num_p), "gamma": np.logspace(-4, 3, num=10)}
clf = SVC(kernel="linear", probability=True)
parameters = {"C": np.logspace(-4, 3, num=num_p)}
gs = GridSearchCV(clf, param_grid=parameters, cv=n_folds)
gs.fit(X, y)
return gs
def get_optim_classifier_svm(indf, inacidlist, clon, clonval):
subdf=indf[inacidlist]
subdfstd=(subdf - subdf.mean())/subdf.std()
X=subdfstd.values
y=(indf[clon].values==clonval)*1
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, train_size=0.8)
#Xtrain, Xtest, ytrain, ytest=X,X,y,y
fitted=cv_optimize_svm(Xtrain, ytrain)
return fitted, Xtrain, ytrain, Xtest, ytest
thesvcfit, Xtr, ytr, Xte, yte = get_optim_classifier_svm(dfnosouth, ['linoleic','arachidic'],'regionstring', "Sardinia")
#thesvcfit, Xtr, ytr, Xte, yte = get_optim_classifier_binary(dfsouthns, ['palmitic','palmitoleic'],'area', 3)
thesvcfit.best_estimator_, thesvcfit.best_params_, thesvcfit.best_score_
def plot_svm_new(clf,Xtr,ytr,Xte,yte):
plt.scatter(Xtr[:, 0], Xtr[:, 1], c=ytr, s=50, cmap='spring')
plt.scatter(Xte[:, 0], Xte[:, 1], marker='s', c=yte, s=50, cmap='spring')
#plt.xlim(-1, 4)
#plt.ylim(-1, 6)
plot_svc_decision_function(clf, plt.gca())
plt.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1],
s=200, facecolors='none')
print dict(kernel="linear",**thesvcfit.best_params_)
clsvc=SVC(**dict(kernel="linear",**thesvcfit.best_params_)).fit(Xtr, ytr)
plot_svm_new(clsvc, Xtr, ytr, Xte, yte)
The best fit allows for a bigger margin by allowing some inbetween penalization. If we use the standard C=1 in scikit-learn you see that we are allowing for less penalization.
#X = dfsouthns[['palmitic', 'palmitoleic']]
#y = (dfsouthns.area.values==3)*1
X=dfnosouth[['linoleic', 'arachidic']]
y = (dfnosouth.regionstring.values=='Sardinia')*1
Xstd= (X -X.mean())/X.std()
Xtrain, Xtest, ytrain, ytest = train_test_split(Xstd.values,y)
clf = SVC(kernel="linear")
clf.fit(Xtrain, ytrain)
plt.scatter(Xtrain[:, 0], Xtrain[:, 1], c=ytrain, s=50, cmap='spring', alpha=0.3)
plot_svc_decision_function(clf, plt.gca())
plt.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1],
s=200, facecolors='none')
plt.scatter(Xtest[:, 0], Xtest[:, 1], c=ytest, s=50, marker="s", cmap='spring', alpha=0.5);
clf
clf.score(Xtest, ytest)
thesvcfit.grid_scores_
plt.scatter([e[0]['C'] for e in thesvcfit.grid_scores_],[e[1] for e in thesvcfit.grid_scores_])
plt.xscale("log");
plt.xlim([0.0001, 10000]);
And this goes super hard margin. (for more insight into all this see http://stats.stackexchange.com/questions/74499/what-is-the-loss-function-of-hard-margin-svm)
#X = dfsouthns[['palmitic', 'palmitoleic']]
#y = (dfsouthns.area.values==3)*1
X=dfnosouth[['linoleic', 'arachidic']]
y = (dfnosouth.regionstring.values=='Sardinia')*1
Xstd= (X -X.mean())/X.std()
Xtrain, Xtest, ytrain, ytest = train_test_split(Xstd.values,y)
clf = SVC(kernel="linear", C=100000)
clf.fit(Xtrain, ytrain)
plt.scatter(Xtrain[:, 0], Xtrain[:, 1], c=ytrain, s=50, cmap='spring', alpha=0.3)
plot_svc_decision_function(clf, plt.gca())
plt.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1],
s=200, facecolors='none')
plt.scatter(Xtest[:, 0], Xtest[:, 1], c=ytest, s=50, marker="s", cmap='spring', alpha=0.5);
print clf.score(Xtest, ytest)
Image("http://scikit-learn.org/dev/_static/ml_map.png")
Descision trees are very simple things we are all familiar with. If a problem is multi-dimensional, the tree goes dimension by dimension and makes cuts in the space to create a classifier.
def classify_tree(indf, inacidlist, clon, clf, train_size=0.6):
subdf=indf[inacidlist]
subdfstd=(subdf - subdf.mean())/subdf.std()
X=subdfstd.values
y=indf[clon].values
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, train_size=train_size)
clf=clf.fit(Xtrain, ytrain)
training_accuracy = clf.score(Xtrain, ytrain)
test_accuracy = clf.score(Xtest, ytest)
print "Accuracy on training data: %0.2f" % (training_accuracy)
print "Accuracy on test data: %0.2f" % (test_accuracy)
return clf, Xtrain, ytrain, Xtest, ytest
#from Jake's ESAC notebook
def visualize_tree(estimator, Xtr, ytr, Xte, yte, boundaries=True,
xlim=None, ylim=None):
estimator.fit(Xtr, ytr)
if xlim is None:
xlim = (Xtr[:, 0].min() - 0.1, Xtr[:, 0].max() + 0.1)
if ylim is None:
ylim = (Xtr[:, 1].min() - 0.1, Xtr[:, 1].max() + 0.1)
x_min, x_max = xlim
y_min, y_max = ylim
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
np.linspace(y_min, y_max, 100))
Z = estimator.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure()
plt.pcolormesh(xx, yy, Z, alpha=0.2, cmap='rainbow')
plt.clim(y.min(), ytr.max())
# Plot also the training points
plt.scatter(Xtr[:, 0], Xtr[:, 1], c=ytr, s=50, cmap='rainbow')
plt.scatter(Xte[:, 0], Xte[:, 1], c=yte, s=50, marker='s', cmap='rainbow')
plt.axis('off')
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.clim(ytr.min(), ytr.max())
# Plot the decision boundaries
def plot_boundaries(i, xlim, ylim):
if i < 0:
return
tree = estimator.tree_
if tree.feature[i] == 0:
plt.plot([tree.threshold[i], tree.threshold[i]], ylim, '-k')
plot_boundaries(tree.children_left[i],
[xlim[0], tree.threshold[i]], ylim)
plot_boundaries(tree.children_right[i],
[tree.threshold[i], xlim[1]], ylim)
elif tree.feature[i] == 1:
plt.plot(xlim, [tree.threshold[i], tree.threshold[i]], '-k')
plot_boundaries(tree.children_left[i], xlim,
[ylim[0], tree.threshold[i]])
plot_boundaries(tree.children_right[i], xlim,
[tree.threshold[i], ylim[1]])
if boundaries:
plot_boundaries(0, plt.xlim(), plt.ylim())
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
clf = DecisionTreeClassifier()
clf, xtr, ytr, xte, yte = classify_tree(dfsouth, ['palmitic','palmitoleic'], 'area', clf)
visualize_tree(clf, xtr, ytr, xte, yte, boundaries=False)
One critical byproduct we get from trees is feature importances
zip(['palmitic','palmitoleic'],clf.feature_importances_)
Run again and again and see the overfit. This Cries Overfit. So why use it?
clf = DecisionTreeClassifier()
clf, xtr, ytr, xte, yte = classify_tree(dfsouthns, ['palmitic','palmitoleic'], 'area', clf)
visualize_tree(clf, xtr, ytr, xte, yte, boundaries=False)
We use it to create a random forest from a set of trees. We do two things:

When we do this we get trees lke the ones seen in the diagram above. What we do now, is that for each region of the space (on some suitable grid) we find out what result each tree gives us. Then we go majority vote, like in kNN. This process of resampling and averaging over the trees reduces the overfitting a bit. For two features the second dosent make a huge amount of sense, but lets see it as we can see that the overfitting is softened a bit by the averaging:
clf=RandomForestClassifier(n_estimators=100)
clf, Xtrain, ytrain, Xtest, ytest=classify_tree(dfsouthns, ['palmitic','palmitoleic'],'area',clf)
visualize_tree(clf, Xtrain, ytrain, Xtest, ytest, boundaries=False)
Lets run it on the entire set of features
clf=RandomForestClassifier(n_estimators=100)
clf, Xtrain, ytrain, Xtest, ytest=classify_tree(dfsouthns, acidlist,'area',clf)
zip(acidlist,clf.feature_importances_)
clf=RandomForestClassifier(n_estimators=100)
clf, Xtrain, ytrain, Xtest, ytest=classify_tree(dfsouth, acidlist,'area',clf)
zip(acidlist,clf.feature_importances_)
unless there are any wierd hyperparams you want to fix. Remember cross-val does two things: prevent overfitting, but also fits hyperparams.
def classify_tree_on_full(indf, inacidlist, clon, clf):
subdf=indf[inacidlist]
subdfstd=(subdf - subdf.mean())/subdf.std()
X=subdfstd.values
y=indf[clon].values
clf=clf.fit(X, y)
training_accuracy = clf.score(X, y)
test_accuracy = clf.oob_score_
print "Accuracy on all data: %0.2f" % (training_accuracy)
print "Accuracy on oob data: %0.2f" % (test_accuracy)
return clf, test_accuracy
newclf=RandomForestClassifier(oob_score=True, n_estimators=100)
newclf, _=classify_tree_on_full(dfsouth, acidlist, 'area', newclf)
zip(acidlist,newclf.feature_importances_)
accus=[]
myrange=range(10, 250, 10)
for i,n in enumerate(myrange):
newclf=RandomForestClassifier(oob_score=True, n_estimators=n)
newclf, a=classify_tree_on_full(dfsouth, acidlist, 'area', newclf)
accus.append(a)
plt.plot(myrange, accus);
newclf.oob_score_
#doing it multi-dimensionally
thesvcfitmd, Xtr, ytr, Xte, yte = get_optim_classifier_svm(dfsouth, acidlist,'area', 3)
#thesvcfit, Xtr, ytr, Xte, yte = get_optim_classifier_binary(dfsouthns, ['palmitic','palmitoleic'],'area', 3)
thesvcfitmd.best_estimator_, thesvcfitmd.best_params_, thesvcfitmd.best_score_
thesvcfitmd.grid_scores_#regularization worsens
plt.scatter([e[0]['C'] for e in thesvcfitmd.grid_scores_],[e[1] for e in thesvcfitmd.grid_scores_])
plt.xscale("log");
plt.xlim([0.0001, 10000]);
thesvcfitmd.score(Xte, yte)
def classify_tree_on_full_binary(indf, inacidlist, clon, clonval, clf):
subdf=indf[inacidlist]
subdfstd=(subdf - subdf.mean())/subdf.std()
X=subdfstd.values
y=(indf[clon].values==clonval)*1
clf=clf.fit(X, y)
training_accuracy = clf.score(X, y)
test_accuracy = clf.oob_score_
print "Accuracy on all data: %0.2f" % (training_accuracy)
print "Accuracy on oob data: %0.2f" % (test_accuracy)
return clf, test_accuracy
newclf=RandomForestClassifier(oob_score=True, n_estimators=100)
newclf, _=classify_tree_on_full_binary(dfsouth, acidlist, 'area', 3, newclf)